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Abstract: 

We  aim  at  establishing  a  theoretical  framework  for  building  a  common  ground  that  may 
allow  for  content-rich,  proficient  and  reliable  communication  between  people  and  robots.  To 
solicit  high  quality  and  spontaneous  contribution  from  people,  we  need  to  make  the  process 
of  common  ground  acquisition  as  playful  as  possible,  by  incorporating  aspects  of  game, 
story  and  play.  The  notion  of  interaction  game  is  critical  to  induce  a  strong  engagement  of 
participants.  We  addressed  learning  from  demonstration  for  acquiring  interaction  patterns, 
cognitive  aspects  of  interaction  game  and  an  integrated  framework  for  interaction  game 
design.  Major  results  encompass  a  robust  learning  algorithm  from  demonstration  SAXImitate, 
an  integrated  toolbox  MC2  (Motif  Change  and  Causality  Discovery),  implementation  and 
evaluation  of  a  virtual  basketball  game,  investigation  of  the  effect  of  back  imitation,  a 
method  for  inducing  intentional  stance  in  HAI  (Human-Agent  Interaction),  using 
physiological  indices  for  discriminating  intrinsic  and  extrinsic  stress,  and  SES  (Synthetic 
Evidential  Study). 

1.  Background 

In  the  Al  era,  human-AI  symbiosis  is  among  the  most  critical  issues.  The  question  is  how  to 
achieve  it.  Our  hypothesis  is  to  achieve  the  sharing  hypothesis:  the  more  is  shared  among 
the  participants,  the  more  empathy  is  gained.  We  believe  that  common  ground,  defined  as 
presuppositions  for  conversation  that  each  participant  is  supposed  to  share  about 
surroundings,  activities,  perceptions,  emotions,  plans,  interests,  etc.,  is  critical  for 
establishing  a  fluent  and  reliable  communication  between  human  and  robot. 

Common  ground  is  either  communal,  consisting  of  human  nature,  communal  lexicons,  and 
cultural  facts/norms/  procedures  or  personal  consisting  of  perceptual  bases  gestural 
indications,  partner's  activities,  salient  perceptual  events,  actional  bases,  and  personal 
diaries  (Clark  1986).  In  spite  of  its  importance  in  communication,  it  is  not  easy  to  build  a 
common  ground.  Hand  crafting  is  too  expensive.  Machine  learning  is  necessary  but  not 
sufficient,  as  it  will  not  automatically  create  knowledge  from  the  scratch.  Under  the  current 
state  of  the  art,  human  computing  is  considered  to  be  a  reasonable  approach  to  break 
through  the  current  limitation.  To  solicit  high  quality  and  spontaneous  contribution  from 
people,  we  need  to  make  the  process  of  common  ground  acquisition  as  playful  as  possible, 
by  incorporating  aspects  of  game,  story  and  play.  The  notion  of  interaction  game,  named 
after  Wittgenstein's  language  game,  is  critical  to  induce  a  strong  engagement  of 
participants. 

Our  research  draws  on  conversational  informatics  which  focuses  on  investigating  human 
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behaviors  and  designing  artifacts  that  can  interact  with  people  in  a  conversational  fashion. 
The  field  draws  on  a  foundation  provided  by  artificial  intelligence,  natural  language 
processing,  speech  and  image  processing,  cognitive  science,  and  conversation  analysis.  It 
aims  to  shed  light  on  meaning  creation  and  interpretation  during  conversation,  in  search  of 
better  methods  of  computer-mediated  communication,  human-computer  interaction,  and 
support  for  knowledge  creation.  We  have  been  working  on  the  idea  since  early  2000's  and 
published  the  idea  in  2007  from  Wiley  followed  by  AISB  workshop  in  2005. 

The  primary  theoretical  backbone  is  conversation  quantization  that  integrates  interaction  and 
descriptive  aspects  of  conversation.  The  conversation  quantization  is  a  theory  to  realize 
conversational  intelligence  based  on  the  idea  of  conversation  quanta.  A  conversation 
quantum  represents  an  ideal  observer's  record  of  conversation.  It  packages  relevant 
participants,  references  to  the  objects  and  events  discussed  in  the  discourse,  a  series  of 
verbal  and  nonverbal  utterances  exchanged  by  the  participants,  commitments  to  previous 
discourse  (themes),  and  new  propositions  in  the  discourse  (rhemes).  In  the  conceptual 
framework  of  conversational  quanta,  a  conversation  quantum  takes  a  form  of  dictation  from 
the  observer's  viewpoint  or  a  script  to  be  played  in  the  forthcoming  sessions  of  conversation. 
We  assume  that  intelligent  actors  such  as  humans  can  both  create  a  collection  of 
conversational  quanta  on  the  fly  while  they  are  talking  in  conversation  (materialization),  and 
enable  conversational  interactions  based  on  a  collection  of  conversational  quanta  as  a 
prototype  (dematerialization).  Memory  processes  will  take  and  accumulate  them  into  the 
structure  of  memory,  as  well  as  retrieve  from  the  memory  on  demand.  Long-term  memory 
processes  will  intervene  to  generalize  and  reorganize  accumulated  conversational  quanta. 
Some  of  them  may  be  generalized  and  stored  as  a  part  of  generic  knowledge  structure  to  be 
applied  to  broader  discourses,  while  others  may  be  indexed  as  a  less  interesting  event  in  the 
episodic  memory.  Conversation  itself  is  regarded  as  a  part  of  a  larger  flow  of  conversational 
quanta  in  a  collective  dynamic  memory  process,  just  as  Clark  characterized  conversation  as 
a  means  for  achieving  a  joint  project. 

Conversational  informatics  depends  on  four  major  techniques.  The  first  is  about 
conversational  artifacts  that  can  participate  in  conversations  with  people.  The  second  is 
about  conversational  contents  that  can  encapsulate  information  and  knowledge  arising  in  a 
conversational  situation  for  reusing  it  in  a  new  conversational  situation.  The  third  is  about 
conversation  environment  design  whose  goal  is  to  build  a  complete  space  that  can  provide 
participants  with  proper  resources  in  conversation  to  enable  smooth  and  effective  interaction. 
The  last  technique  is  about  conversation  measurement,  analysis,  and  modeling.  We  take  a 
data-driven  quantitative  approach  to  understand  conversational  behaviors  by  measuring 
conversational  behaviors  using  advanced  sensing  technologies  and  thereby  aim  at  building 
detailed  quantitative  models  of  conversation. 

Our  research  group  has  developed  a  rather  comprehensive  suite  of  techniques  for 
conversational  informatics  ranging  from  analysis  to  application.  Our  approach  has  been 
based  on  sharing  hypothesis:  "the  more  is  shared  among  the  participants,  the  more 
empathy  is  gained."  We  have  not  tried  to  dive  for  the  ultimate  success— realization  of  a 
comprehensive  artificial  mind— from  the  beginning.  Instead,  we  have  been  taking  the 
state-of-the-art  in  conversational  informatics  as  a  basis  and  conducted  an  empirical 
approach  by  repeatedly  increasing  the  common  ground  shared  between  humans  and 
artificial  agents.  Among  others,  it  involves  multiple-sensor-based  3D  conversation 
measurement  and  analysis  suites  for  empirical  study  of  social  signal  processing  involving 
human-robot  and  even  human-human  interactions;  immersive  wizard-of-oz  (WOZ) 
environment  for  human-robot  interaction  design  and  evaluation;  cyber-physical  conversation 
environment;  learning-by-imitation  as  a  data-intensive  development  of  robot's  competence 
of  interaction;  interactive  joint  intention  formation  as  collaboration  through  conversation; 
simulated  people  as  a  means  for  investigating  real-time  collaboration  between  humans  and 
robots. 
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Common  ground  has  been  one  of  the  central  issues  in  conversational  informatics.  Although 
the  idea  of  co-evolution  of  common  ground  and  conversational  intelligence  was  introduced 
to  boot-strap  "the  primordial  soup  of  conversation"  as  a  shared  knowledge  source  for  a 
conversational  community  of  people  and  agents,  its  substantial  implementation  has  not  been 
conducted  yet,  due  to  the  lack  of  a  methodology. 

2.  Goal  and  Approach 

The  primary  goal  of  this  research  is  to  establish  a  conceptual  framework  that  can  serve  as  a 
dependable  theoretical  basis  for  building  a  common  ground.  Toward  this  end,  we  build  on 
conversational  informatics,  by  addressing  several  technical  aspects  that  we  believe  are 
critical  to  common  ground  building. 

The  first  is  learning  by  active  imitation  which  allows  robots  to  engage  in  interactions  with 
people  and  other  robots  to  share  information,  based  on  a  collection  of  data  obtained  from 
human-agent  interaction  in  the  WOZ  experiment.  It  implements  the  idea  that  the  learning 
robot  "watches"  how  people  interact  with  each  other,  estimates  how  the  target  actor  reacts 
according  to  the  communicative  behavior  of  the  communication  partner,  and  applies  the 
acquired  knowledge  as  estimated  pat-terns  of  actions  to  the  actual  situations  it  encounters 
in  conversation.  So  far,  we  have  developed  a  suite  of  unsupervised  learning  algorithms  for 
this  framework.  On  top  of  the  framework,  we  aim  at  adding  autonomy  and  fluidity  to  our 
framework  of  learning  by  imitation.  The  major  ideas  are  introduction  of  a  top-down 
hierarchical  feature  and  a  developmental  computational  mechanism. 

The  second  is  estimation  of  the  user's  mental  status  by  integrating  audio-visual  cues  and 
physiological  indices.  An  ability  to  estimate  the  user's  emotion  and  stress  is  considered  to  be 
mandatory  for  building  fellow  agents  that  can  build  and  sustain  an  almost  equal  relationship 
with  the  user  by  exploiting  the  common  ground  depending  on  the  situation  including  the 
mental  status  of  the  user.  A  more  complex  method  of  estimation  is  needed  in  building  a 
facilitator  agent  that  can  help  a  group  of  users  to  discuss  in  an  effective  and  productive 
fashion.  Dynamic  nature  of  the  user's  intention  needs  to  be  taken  into  account.  Physiological 
indices  may  be  used  to  obtain  more  reliable  estimate  in  conducting  experiments,  while  they 
may  not  be  used  in  normal  interactions. 

The  third  is  an  integrated  framework  for  interaction  game  design.  We  consider  that 
technology-enhanced  combination  of  dramatic  role  play  and  group  discussions  is  promising, 
as  the  latter  allows  most  people  to  participate  in  and  contribute  to  the  activity,  the  former 
permits  each  participant  to  reorganize  knowledge  from  the  viewpoint  of  a  participating  actor, 
while  technology-enhancement  is  mandatory  to  make  the  group  activity  productive.  To 
substantiate  the  idea,  we  have  set  out  the  conceptual  framework  of  synthetic  evidential 
study  (SES),  a  novel  technology-enhanced  methodology  that  combines  dramatic  role  play 
and  group  discussion  to  help  people  learn  by  spinning  stories  comprised  of  partial  thoughts 
and  evidence.  We  have  completed  an  initial  feasibility  study  on  the  effect  of  SES  and  figured 
out  an  architecture  of  an  SES  support  system  on  the  basis  of  conversational  informatics.  We 
believe  that  SES  is  a  powerful  and  practical  paradigm  of  human  computing  for  building  a 
common  ground. 

Our  approach  is  highly  empirical  in  nature  regarding  learning-by-imitation  and  estimation  of 
mental  status  as  well,  though  theoretical  approach  is  needed  to  stabilize  the  basis  and 
integrate  the  insights  obtained  from  experiments. 

3.  Results 

Major  results  encompass  a  robust  learning  algorithm  from  demonstration  SAXImitate,  an 


DISTRIBUTION  A.  Approved  for  public  release:  distribution  unlimited. 


integrated  toolbox  MC2  (Motif  Change  and  Causality  Discovery),  implementation  and 
evaluation  of  a  virtual  basketball  game,  investigation  of  the  effect  of  back  imitation,  a 
method  for  inducing  intentional  stance  in  HAI  (Human-Agent  Interaction),  using 
physiological  indices  for  discriminating  intrinsic  and  extrinsic  stress,  and  SES  (Synthetic 
Evidential  Study). 

3.1  Learning  from  Demonstration  for  Acquiring  I  nteraction  Patterns 

3.1.1  Robust  Learning  from  Demonstrations  using  Multidimensional  SAX 

We  have  developed  a  novel  LfD  (Learning  from  Demonstration)  system  based  on  a 
symbolization  approach  that  utilizes  the  Symbolic  Aggregate  approximation  (SAX)  timeseries 
symbolization  algorithm  after  extending  it  to  handle  multi-dimensional  time  series  found  in 
LfD  tasks.  The  proposed  method,  called  SAXImitate,  can  be  used  as  a  stand-alone  LfD 
system.  SAXImitate  is  more  resistant  to  confusing  demonstrations  that  usually  arise  when 
action  segmentation  is  automated.  It  can  be  used  to  cope  with  difficulties  such  as  data 
scarcity,  distortions,  and  confusions.  Alternatively  it  can  be  used  as  a  filtering  step  to  provide 
better  demonstration(s)  for  any  other  LfD  system  to  take  advantage  of  any  superior 
encoding-utility  of  these  systems  that  may  be  relevant  to  a  specific  LfD  task. 

This  allows  the  proposed  method  to  combine  the  advantage  of  being  specially  noise  resistant 
with  the  advantages  of  other  LfD  systems.  The  proposed  system  is  compared  to 
representative  implementations  of  two  of  the  most  widely  used  LfD  systems,  namely, 
Dynamic  Motor  Primitives  (DMP)  and  probabilistic  modeling  exemplified  by  the  Gaussian 
Mixture  Modeling/Gaussian  Mixture  Regression  (GMM/GMR)  approach  in  learning  basic 
motions  from  the  CMU  Mocap  database  as  well  as  learning  2D  drawings  from 
demonstrations. 

3.1.2  I  ntegrated  Toolbox  MC2 

To  package  tools  into  a  suite,  we  developed  an  integrated  toolbox  MC2  (Motif  Change  and 
Causality  Discovery)  that  implements  several  state-of-the-art  approaches  to  core  problems  in 
time-series  analysis.  MC2  includes  MD  (Motif  Discovery),  CPD  (Change  Point  Discovery),  and 
CID  (Causality  Influence  Discovery)  for  several  applications  in  human  behavior 
understanding,  physiological  signal  processing,  forecasting,  analysis  of  climate  data, 
cognitive  modeling,  etc.  MC2  implements  multiple  approaches  to  each  of  the  three  main 
problems  of  discovery  in  time-series  and  provides  evaluation  routines  to  compare  the 
adequacy  to  different  mining  tasks.  Showcase  real-world  applications  include  gesture 
discovery  from  accelerometer  data  and  fluid  imitation  in  human-robot  interaction. 

3.2  Cognitive  Aspects  of  I  nteraction  Game 

3.2.1  User  Perceptions  of  Communicative  and  Task-competent  Agents  in  a  Virtual 
Basketball  Game 

We  take  the  basketball  game  as  a  typical  example  of  multi-player,  real-time,  situated  game 
where  collaboration  and  competence  are  deemed  critical  factors  as  well  as  physical 
performance.  We  are  attempting  at  modeling  the  basketball  game  using  Clark's  joint  activity 
theory.  We  characterize  the  top  level  of  the  basketball  game  as  a  collection  of  joint  projects, 
including  those  for  getting  the  ball  into  the  opposition's  hoop  and  stopping  the  other  team 
from  scoring.  We  model  lower  levels  using  the  notion  of  joint  projects  for  passing  and 
catching  the  ball,  or  running  certain  plays.  We  are  attempting  at  identifying  various  signals 
players  use  intentionally  to  communicate  with  each  other.  For  example,  a  player  may  look 
towards  a  team-mate  to  indicate  a  pass,  do  a  specific  gesture  indicating  a  certain  play  is  to 
be  executed,  and  so  on.  There  are  many  situational  gestures  which  may  be  executed  at 
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various  points  in  the  game  and  ascribed  meaning.  While  simple  signals  such  as  pointing  are 
understandable  to  a  majority  of  people,  complex  signals  may  be  team-specific  and  therefore 
impossible  for  those  not  within  the  team  to  interpret.  We  are  attempting  to  capture 
modalities  players  are  using  explicitly  or  implicitly  to  send  signals.  Verbal  utterances  may  be 
used  to  direct  players,  though  these  are  also  received  by  members  of  the  opposing  team.  As 
the  basketball  game  is  a  fast-paced  sport,  complex  strategies  simply  cannot  be  executed 
through  utterances  alone.  Players  may  look  towards  non-verbal  clues  in  order  to  determine 
the  next  course  of  action.  These  clues  can  be  the  state  of  the  game  itself  (i.e. ,  spatial 
positions  of  players  on  the  court),  or  signals  performed  by  players  in  the  game.  Players  also 
use  rich  body  expressions. 

We  prototyped  as  a  research  platform  a  virtual  basketball  game  that  will  be  played  in  a 
cyberspace  by  an  ensemble  of  humans,  avatars,  and  agents,  when  completed.  We  assume 
that  the  human  players  are  familiar  with  the  objective  and  rules  of  the  basketball  game.  The 
user  should  be  able  to  interact  with  their  agent  partners  to  create  a  situation  of 
understanding  and  intuition  which  can  be  found  in  real-world  sports  teams.  Each  player  in 
an  immersive  interaction  environment  is  given  the  first-person  or  third-person  with  a 
tracking  camera  behind  the  user's  avatar.  We  designed  a  virtual  basketball  game  in  which 
the  users  could  control  an  avatar,  perform  basketball  gestures  and  navigate  the  court 
without  the  need  for  hand-held  peripherals.  We  incorporate  into  agents  an  ability  to  sustain 
proper  social  interactions  with  people.  We  extend  the  joint  activity  theory  of  H.  Clark  as  a 
basis  of  designing  and  understanding  interaction  game. 

We  evaluated  our  joint  activity  theory-based  agent  model  for  the  communication-competent 
agent.  We  assessed  people's  perceptions  of  an  agent  team  mate  with  higher  basketball 
ability  against  one  with  higher  communication  ability.  We  found  that  people  were  able  to 
distinguish  between  the  two  agents,  and  preferred  the  one  with  higher  communication 
ability  but  there  existed  no  difference  in  the  perceived  intelligence  of  the  agents.  This  would 
suggest  that  users  prefer  communication  ability  to  task  ability  in  this  environmental,  though 
this  could  largely  be  due  to  the  nature  of  the  game  itself. 

3.2.2  Effect  of  Back  I  mitation  on  J  udgment  of  I  mitative  Skill 

We  have  conducted  three  studies  to  explore  the  effects  of  imitating  a  robot  (back  imitation) 
on  human's  perception  of  this  robot  in  terms  of  imitative  skill,  interaction  quality,  humanness 
of  the  robot  and  intention  of  future  interaction.  The  results  of  these  studies  taken  together 
show  that  subjects  preferred  the  robot  that  they  previously  imitated  in  terms  of  imitation 
skill,  naturalness,  and  motion  human-likeness  compared  with  the  robot  that  they  did  not 
imitate.  There  was  no  difference  between  the  simple  back  imitation  and  the  more  complex 
mutual  imitation  conditions  in  the  main  study.  This  result  supports  the  use  of  a  back 
imitation  familiarization  session  before  learning  form  demonstration  sessions.  More  generally, 
this  result  emphasizes  the  importance  of  taking  the  interaction  context  (and  history)  into 
consideration  when  attributing  differences  in  people's  responses  to  robots. 

3.2.3  I  nduction  of  I  ntentional  Stance  in  HAI  by  Presenting  Goal-Oriented 
Behavior  using  Multimodal  I  nformation 

The  presence  of  fellow  agents  is  critical  for  provoking  the  users'  serious  engagement  to 
make  the  human-agent  interaction  meaningful  and  productive,  contributing  to  cultivating  the 
common  ground.  We  address  conditions  for  motivating  people  to  bear  not  only  a  positive 
feeling  but  also  an  intentional  stance  toward  agents  in  the  interaction  game  environment.  In 
order  to  endow  the  agents  with  the  ability  of  emotionally  or/and  intentionally  interact  with 
the  user,  we  investigated  cognitive  aspects  that  may  bring  about  enhancement  of  agents  as 
a  self-independent  being. 
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We  focused  on  "intentional  stance".  Intentional  stance  is  a  mental  state  in  which  we  think 
that  an  interaction  partner  has  intention.  We  hypothesized  that  agents  could  induce  the 
intentional  stance  by  performing  goal-oriented  actions  in  human-agent  interaction.  To 
investigate  the  effect  of  induction  of  intentional  stance,  we  made  two  agents:  a 
"trial-and-error  agent"  that  performed  goal-oriented  actions  using  multimodal  behavior  and  a 
"text  display  agent"  that  displayed  its  behavioral  intention  via  text.  We  conducted  an 
experiment  in  which  two  participants  played  customized  tag  in  virtual  reality  with  one  of  the 
agents.  The  results  showed  that  participants  continuously  tried  to  communicate  with  the 
trial-and-error  agent,  which  did  not  respond  to  the  participant's  actions  except  when 
necessary  for  performing  the  task.  We  found  that  the  participants  felt  that  the  agent  using 
multimodal  nonverbal  behavior  was  more  goal-oriented,  more  intelligent  and  understood 
their  intentions  more  than  the  agent  that  displayed  text  above  its  head.  Thus,  we  were  able 
to  induce  the  intentional  stance  by  presenting  a  trial-and-error  process  using  multimodal 
behavior.  We  investigated  a  method  for  estimating  the  degree  of  concentration  based  on  the 
physiological  indices  during  VR  exercise  games.  Also,  in  order  to  confirm  whether  or  not  the 
degree  of  users'  concentration  keep  by  the  advice  at  the  timing  based  on  the  physiological 
indices,  we  controlled.  As  a  result,  in  experimental  group,  the  subjective  degree  of  users’ 
concentration  increased.  In  addition,  the  number  of  reaction  of  SCR  decreased  and  of  LF/HF 
increased.  The  results  of  the  experiment  suggest  the  possibility  of  inducing  users  to  intended 
state  by  the  advice  at  the  timing  based  on  the  physiological  indices. 

3.2.4  Using  Physiological  I  ndices  for  Discriminating  I  ntrinsic  and  Extrinsic  Stress 

We  used  Immersive  Collaborative  Interaction  Environment  ( I  Cl  E) ,  as  it  allowed  participants 
to  easily  look  around  in  the  virtual  space  similar  to  the  real  world.  The  participant's  virtual 
avatar  was  intuitively  controlled  by  their  body  motions  using  motion  sensors  placed  on  their 
dominant  arm,  both  feet  and  waist.  To  determine  the  participant's  inner  state,  exercise 
quantity  was  estimated  from  the  stepping  motion  and  SCR  and  LF/HF  were  recorded  by  a 
device  for  measuring  the  physiological  indices.  This  information  was  sent  to  the  game 
system  in  real  time.  To  distinguish  intrinsic  and  extrinsic  stress,  the  numbers  of  SCR  and 
LF/HF  responses  were  counted. 

We  developed  a  method  for  discriminating  whether  the  cause  of  changes  in  a  human's 
activities  are  intrinsic  factors  related  to  spontaneous  mental  activities,  or  extrinsic  stimuli 
generated  by  circumstances  in  continuous  interaction,  based  on  physiological  indices  and 
game  context.  A  key  to  building  an  empathetic  pedagogical  agent  to  endow  it  an  ability  of 
estimating  students'  inner  states  to  evaluate  the  adequacy  of  an  interaction  agent's  behavior 
and  the  difficulty  of  the  task  set  for  the  user. 

We  found  that  the  SCR  responses  changed  inversely  with  each  other,  with  the  difference  in 
the  experimental  conditions  being  the  inner  state  of  the  participant  when  the  agent  provided 
advice.  This  indicates  that  the  effect  of  the  advice  differed  depending  on  the  inner  state  of 
the  participant;  and  suggests  that  advice  from  the  agent  cannot  create  adequate  learning 
effects  without  appropriate  evaluation  of  a  person's  inner  state. 

3.3  Synthetic  Evidential  Study— an  I  ntegrated  Framework 

Synthetic  evidential  study  (SES  for  short)  is  a  novel  technology-enhanced  methodology  for 
combining  theatrical  role  play  and  group  discussion  to  help  people  spin  stories  by  bringing 
together  partial  thoughts  and  evidences.  SES  combines  theatrical  role  play  and  group 
discussion  to  help  people  spin  stories  by  bringing  together  partial  thoughts  and  evidences. 

The  SES  framework  consists  of  the  SES  sessions  and  the  interpretation  archives.  In  each  SES 
session,  participants  repeat  a  cycle  of  a  theatrical  role  play,  its  projection  into  an  annotated 
agent  play,  and  a  group  discussion.  One  or  more  successive  execution  of  SES  sessions  until 
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participants  come  to  a  (temporary)  satisfaction  is  called  a  SES  workshop.  In  the  theatrical 
role  play  phase,  participants  play  respective  roles  to  demonstrate  their  first-person 
interpretation  in  a  virtual  space.  It  allows  them  to  interpret  the  given  subject  from  the 
viewpoint  of  an  assigned  role.  In  the  projection  phase,  an  annotated  agent  play  on  a  game 
engine  is  produced  from  the  theatrical  role  play  in  the  previous  phase  by  applying  the  oral 
edit  commands  (if  any)  to  theatrical  actions  by  actors  elicited  from  the  all  behaviors  of  actors. 
We  employ  annotated  agent  play  for  reuse,  refinement,  and  extension  in  the  later  SES 
sessions.  In  the  critical  discussion  phase,  the  participants  or  other  audience  share  the 
third-person  interpretation  played  by  the  actors  for  criticism.  The  actors  revise  the  virtual 
play  until  they  are  satisfied.  The  understanding  of  the  given  theme  will  be  progressively 
deepened  by  repeatedly  looking  at  embodied  interpretation  from  the  first-  and  third-  person 
views.  The  interpretation  archive  logistically  supports  the  SES  sessions.  The  annotated  agent 
plays  and  stories  resulting  from  SES  workshops  may  be  decomposed  into  components  for 
later  reuse  so  that  participants  in  subsequent  SES  workshops  can  adapt  previous  annotated 
agent  plays  and  stories  to  use  as  a  part  of  the  present  annotated  agent  play. 

SES  leverages  powerful  game  technologies  to  engage  participants  in  a  collaborative  study  on 
unveiling  mysteries  and  other  kinds  of  social  processes  by  combining  dramatic  role  play, 
agent  play  and  group  discussions  to  spin  a  story  as  a  joint  interpretation.  It  is  evident  that 
SES  is  a  promising  approach  to  build  common  ground  in  a  cost-effective  fashion.  The  SES 
support  system  is  indispensable  to  make  the  SES  methodology  for  everybody.  Work  is  in 
progress  to  build  the  SES  support  system  by  combining  technologies  Nishida's  group 
developed  for  conversational  informatics  research.  It  consists  of  an  immersive  interaction  & 
collaboration  environment  for  the  shared  virtual  world,  dramatic  group  play  capture,  creating 
agent  play,  discussion  support,  and  conversation  quantization.  Learning  by  imitation  and 
cognitive  interaction  design  that  Nishida's  group  is  addressing  will  make  the  SES  support 
system  even  more  powerful.  Potential  application  of  SES  is  particularly  useful  in  collaborative 
learning  and  authoring  of  interactive  drama  in  storytelling  that  can  be  used  in  broad  areas 
ranging  from  education  and  training  to  planning  and  simulation. 

Prototyping  an  SES  support  system  started  in  2014,  by  incorporating  a  framework  of 
interaction  game,  learning  by  imitation,  and  cognitive  interaction  design.  We  have  extended 
the  prototype  of  the  SES  support  system  built  in  the  first  year  in  several  ways.  First,  we 
prototyped  components  for  the  SES  users  to  exploit  the  interpretation  archive  to  share  and 
reuse  annotations.  Second,  we  introduced  a  light-weight  tool  for  building  the  virtual  space 
that  may  encompass  virtual  actors.  Third,  we  developed  a  distributed  shared  environment 
for  experimenting  interactions  as  a  virtual  participant  from  the  first  person  perspective. 

We  conducted  a  feasibility  study  with  a  partially  implemented  SES  Support  System.  There 
were  numerous  encouraging  results  regarding  the  typical  behavioral  patterns  of  participants 
in  role  playing  and  how  the  inside  understanding  was  deepened  through  the  role  play  and 
discussion  in  the  SES  framework.  We  have  conducted  several  experiments  including 
incrementally  accumulating  interpretation  of  a  passage  of  a  novel  "in  the  wood",  a  classic 
suspense  story  in  which  each  character's  intentions  tangle  in  a  complex  fashion.  Participants 
were  asked  to  annotate  scenes  by  adding  comments  of  eight  types:  clarification, 
confirmation,  empathy,  conjecture,  doubt,  question,  and  surprise.  The  experiment  consists 
of  three  stages.  At  the  first  stage,  the  participants  had  to  start  from  no  annotations,  whereas 
at  the  second  and  third  stages,  they  were  allowed  to  refer  to  the  annotations  produced  by 
predecessors.  Annotations  are  structured  as  an  interpretation  network  in  terms  of 
dependency  between  different  stages. 

To  leverage  the  power  of  the  SES  framework  in  social  interaction  analysis  and  design,  we 
prototyped  a  shared  virtual  space  around  a  ticket  office  where  avatars  of  participants  of 
different  roles  are  interacting  with  each  other  to  play  a  role-specific  behaviors,  such  as  a 
counter  person  and  a  customer.  We  used  this  virtual  setting  to  conduct  a  preliminary 
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experiment  to  compare  the  participants'  comprehension  of  given  social  situations  and 
preferred  behaviors  from  the  first-  and  third  person  view. 

We  have  found  that  the  SES  framework  effectively  helped  participants  to  formulate  a 
coherent  and  consistent  understanding  of  the  given  story  by  sharing  articulated 
interpretations.  According  to  the  preliminary  analysis  we  have  obtained,  the  participants 
focused  more  on  identifying  facts  and  their  relations  by  adding  more  confirmation-type 
annotations  on  the  early  stage  of  interpretation,  while  they  added  more  conjecture-type 
annotations  on  the  later  phase.  Based  on  correlation  analysis,  we  obtained  that  correlation 
between  conjecture  and  the  depth  of  understanding  measured  by  self-report  in 
questionnaire  is  positive  at  the  third  stage.  We  also  obtained  from  questionnaire  analysis 
that  none  of  the  four  participants  reported  that  their  opinion  changed  at  the  second  stage, 
while  all  the  five  participants  reported  their  opinion  change  at  the  third  stage.  It  suggests 
that  the  participants  started  to  build  their  interpretations  by  themselves.  We  suspect  that 
questions  and  conjectures  promote  new  conjecture.  The  more  conjecture,  the  less  questions. 
It  may  promote  the  deepening  of  interpretation  and  change  of  opinion.  Furthermore,  it  is 
suggested  that  critical  comments  can  be  identified  based  on  the  degree  centrality,  i.e.,  the 
degree  of  a  node,  and  critical  comments  are  clustered  in  the  interpretation  network. 

As  a  result  from  from  a  preliminary  experiment  with  a  shared  virtual  space  around  a  ticket 
office,  we  have  found  that  interpretation  of  the  situation  relies  on  the  participant's  viewpoint 
in  a  situation.  The  first  person  view  seemed  to  induce  stronger  sense  of  engagement.  We 
observed  that  the  participants  preferred  a  service  counter  with  their  cultural  style  to  another 
with  a  different  cultural  style;  they  followed  the  social  behaviors  of  their  predecessors;  and 
they  preferred  a  fair  service  counter  who  rejects  unreasonable  line  cutting. 
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